35 research outputs found

    rQuant.web: a tool for RNA-Seq-based transcript quantitation

    Get PDF
    We provide a novel web service, called rQuant.web, allowing convenient access to tools for quantitative analysis of RNA sequencing data. The underlying quantitation technique rQuant is based on quadratic programming and estimates different biases induced by library preparation, sequencing and read mapping. It can tackle multiple transcripts per gene locus and is therefore particularly well suited to quantify alternative transcripts. rQuant.web is available as a tool in a Galaxy installation at http://galaxy.fml.mpg.de. Using rQuant.web is free of charge, it is open to all users, and there is no login requirement

    Statistical Tests for Detecting Differential RNA-Transcript Expression from Read Counts

    Get PDF
    As a fruit of the current revolution in sequencing technology, transcriptomes can now be analyzed at an unprecedented level of detail. These advances have been exploited for detecting differential expressed genes across biological samples and for quantifying the abundances of various RNA transcripts within one gene. However, explicit strategies for detecting the hidden differential abundances of RNA transcripts in biological samples have not been defined. In this work, we present two novel statistical tests to address this issue: a 'gene structure sensitive' Poisson test for detecting differential expression when the transcript structure of the gene is known, and a kernel-based test called Maximum Mean Discrepancy when it is unknown. We analyzed the proposed approaches on simulated read data for two artificial samples as well as on factual reads generated by the Illumina Genome Analyzer for two _C. elegans_ samples. Our analysis shows that the Poisson test identifies genes with differential transcript expression considerably better that previously proposed RNA transcript quantification approaches for this task. The MMD test is able to detect a large fraction (75%) of such differential cases without the knowledge of the annotated transcripts. It is therefore well-suited to analyze RNA-Seq experiments when the genome annotations are incomplete or not available, where other approaches have to fail

    Transcript quantification with RNA-Seq data

    Get PDF
    Motivation Novel high-throughput sequencing technologies open exciting new approaches to transcriptome profiling. Sequencing transcript populations of interest, e.g. from different tissues or variable stress conditions, with RNA sequencing (RNA-Seq) [1] generates millions of short reads. Accurately aligned to a reference genome, they provide digital counts and thus facilitate transcript quantification. As the observed read counts only provide the summation of all expressed sequences at one locus, the inference of the underlying transcript abundances is crucial for further quantitative analyses. Methods To approach this problem, we have developed a new technique, called rQuant, based on quadratic programming. Given a gene annotation and position-wise exon/intron read coverage from read alignments, we determine the abundances for each annotated transcript by minimising a suitable loss function. It penalises the deviation of the observed from the expected read coverage given the transcript weights. The observed read coverage is typically non-uniformly distributed over the transcript due to several biases in the generation of the sequencing libraries and the sequencing. This leads to distortions of the transcript abundances, if not corrected properly. We therefore extended our approach to jointly optimise transcript profiles, modeling the coverage deviations depending on the position in the transcript. Our method can be applied without knowledge of the underlying transcript abundances and equally benefits from loci with and without alternative transcripts. Results To quantitatively evaluate the quality of our abundance predictions, we used a set of simulated reads from transcripts with known expression as a benchmark set. It was generated using the Flux Simulator [2] modeling biases in RNA-Seq as well as preparation experiments. Table 1 shows preliminary results with segment- and position-based loss as well as with and without the transcript profiles. Our results indicate that the position-based modeling together with transcript profiles allows us to accurately infer the underlying expression of single transcripts as well as of multiple isoforms of one gene locus

    Oqtans: a Galaxy-integrated workflow for quantitative transcriptome analysis from NGS Data : From Seventh International Society for Computational Biology (ISCB) Student Council Symposium 2011 Vienna, Austria. 15 July 2011

    Get PDF
    First published by BioMed Central: Schultheiss, Sebastian J.; Jean, Géraldine; Behr, Jonas; Bohnert, Regina; Drewe, Philipp; Görnitz, Nico; Kahles, André; Mudrakarta, Pramod; Sreedharan, Vipin T.; Zeller, Georg; Rätsch, Gunnar: Oqtans: a Galaxy-integrated workflow for quantitative transcriptome analysis from NGS Data - In: BMC Bioinformatics. - ISSN 1471-2105 (online). - 12 (2011), suppl. 11, art. A7. - doi:10.1186/1471-2105-12-S11-A7

    Computerbasierte Verfahren für Hochdurchsatzgenomik und -transkriptomik

    No full text
    The completion of genome sequences for many species, including humans and a number of model organisms, was considered a major milestone at the turn of the millennium. It has been quickly realised that focusing on a single reference genome per species is insufficient to understand the diversity within and between organisms. However, for each species, a multitude of genome sequences are required to give insight into causal sequences for variable traits. The advent of high-throughput technologies such as next-generation sequencing has undoubtedly accelerated sequencing and allowed for many exciting large-scale studies in genetics that were not previously conceivable. Genome-wide association studies, in which the linkage of sequence and phenotype variations is investigated, have highly profited from the recent technology development. For these kinds of studies, it is indispensable to measure genotypes and relevant traits for a large set of individuals. Because such data is of immense quantity and often noisy, computational approaches are required to analyse data from next-generation genetics. In the context of my thesis, I have contributed to the analysis of biological high-throughput data in two respects. In order to accurately describe genotypes, I have designed efficient large-scale tools to identify and catalogue polymorphisms from array data. Moreover, I have developed approaches for the estimation of transcript abundances from next-generation sequencing data, enabling precise analyses of transcriptomes. The first part of my thesis focuses on the analysis of array-based resequencing data that was obtained to describe sequence variation across 20 diverse varieties of domesticated rice. I applied sophisticated machine learning methods for efficient and accurate analysis of this enormous set of hybridisation data. Using an approach based on Support Vector Machines, I uncovered more than 300,000 non-redundant single-nucleotide polymorphisms, which were found to be highly accurate assessed on a gold standard set of polymorphisms. For the detection of complex regions of polymorphisms, I employed a second machine learning method based on Hidden Markov Support Vector Machines, revealing between 65,000 and 203,000 polymorphic regions across varieties and complementing the SNP set derived with the SVM-based approach. Altogether, detecting hundreds of thousands of polymorphisms on a genome-wide scale has enabled the assembly of the first whole genome set of polymorphisms for the world's most important crop plant. In the second part of my dissertation, I address the question of accurate quantification of transcriptomes from RNA sequencing measurements. For this purpose, I developed a novel computational method that uses techniques from machine learning and optimisation. In particular, this tool, rQuant, infers the abundance of alternative transcripts and simultaneously estimates the effect of biases induced by experimental settings. Quantifying transcripts from artificial as well as experimental data sets demonstrated the superiority of rQuant in an evaluation for diverse settings and a comparison against other transcript quantification tools. Moreover, I adapted ideas of rQuant to develop a tool for quantitative deconvolution of RNA secondary structures. rQuant is available to the community as open-source software and as a web service. In conclusion, my thesis contributes to key parts of research in high-throughput genomics and transcriptomics. This work will facilitate the identification of genotype and phenotype linkage and will improve our understanding of the biological processes that make individuals unique.Die Sequenzierung von Genomen vieler Arten, darunter die des Menschen und einiger Modellorganismen, war ein wichtiger Meilenstein der Jahrtausendwende. Es wurde schnell klar, dass es nicht ausreicht, nur ein einzelnes Referenzgenom pro Art zu betrachten, um die Vielfalt innerhalb und zwischen Organismen zu verstehen. Viele Genomsequenzen pro Art sind notwendig, um zu verstehen, welche Sequenzen ursächlich für variable Merkmale sind. Die Einführung von Hochdurchsatzverfahren, wie zum Beispiel von Sequenziermethoden der nächsten Generation, haben zweifelsohne das Sequenzieren beschleunigt und ermöglichen viele interessante Genetikstudien im großen Umfang, die zuvor undenkbar waren. Genomweite Assoziationsstudien, in denen die Verbindung von Sequenz- und Phänotypvarianten untersucht werden, haben im großen Maße von der jüngsten Technologieentwicklung profitiert. Für diese Art von Studien ist es unabdingbar, Genotypen und relevante Eigenschaften für eine große Zahl an Individuen zu messen. Da diese Daten von immenser Größe und oft verrauscht sind, sind computergestützte Verfahren für die Datenanalyse in der Genetik notwendig. Im Rahmen meiner Doktorarbeit trug ich zur Analyse von biologischen Hochdurchsatzdaten in zweierlei Hinsicht bei. Um Genotypen genau zu beschrieben, entwarf ich effiziente Programme zur Erkennung und Katalogisierung von Polymorphismen basierend auf Arraydaten. Außerdem entwickelte ich Methoden, um Transkriptmengen aus Messungen neuartiger Sequenziertechnologien zu schätzen, die präzise Analysen von Transkriptomen ermöglichen. Der erste Teil meiner Doktorarbeit beschäftigt sich mit der Untersuchung von arraybasierten Resequenzierdaten, die generiert wurden, um Sequenzvariation innerhalb 20 verschiedener domestizierter Reissorten zu beschreiben. Ich verwendete ausgefeilte Methoden des maschinellen Lernens, um die große Menge an Hybridisierungsdaten effizient und genau zu analysieren. Basierend auf Support-Vector-Maschinen entdeckte ich mehr als 300.000 nicht-redundante Einzelnukleotidpolymorphismen, die sich, evaluiert an Hand eines Goldstandard für Polymorphismen, als sehr genau erwiesen. Um komplexe Polymorphismenregionen zu erkennen, wandte ich eine weitere Methode des maschinellen Lernens basierend auf Hidden-Markov-Support-Vector-Maschinen an, die zwischen 65.000 und 203.000 polymorphe Regionen innerhalb der Reissorten identifizierte und den SNP-Datensatz der SVM-Methode komplemenierte. Beide Ansätze zusammengenommen detektierten genomweit Hunderttausende von Polymorphismen, wodurch der erste Polymorphismendatensatz für das vollständige Genom der weltweit wichtigsten Nutzpflanze erstellt werden konnte. Im zweiten Teil meiner Dissertation widme ich mich der Fragestellung, Transkriptome, welche mit Hilfe von RNA-Sequenzierung gemessen werden, genau zu quantifizieren. Dazu entwickelte ich eine neuartige computerbasierte Methode, die Techniken aus dem maschinellen Lernen und der Optimierung verwendet. Dieses Programm, rQuant, inferiert Mengen von alternativen Transkripten und schätzt gleichzeitig den Einfluss von Verzerrungen, die durch experimentelle Protokollgegebenheiten herbeigeführt werden. Die Quantifizierung von Transkripten an Hand künstlicher sowie experimenteller Datensätze zeigte die Vorzüge von rQuant in einer Auswertung unterschiedlicher Programmeinstellungen und in einem Vergleich mit anderen Transkriptquantifizierungsprogrammen. Darüber hinaus verwendete ich Ideen von rQuant, um ein Programm zu entwickeln, das RNA-Sekundärstrukturen quantifiziert. rQuant ist sowohl als Open-Source-Software als auch Webservice verfügbar. Zusammenfassend lässt sich sagen, dass meine Doktorarbeit zu wichtigen Teilen der Forschung in der Hochdurchsatzgenomik und -transkriptomik beiträgt. Diese Arbeit wird die Indentifizierung von Verbindungen zwischen Geno- und Phänotyp vereinfachen und unser Verständnis von biologischen Prozessen, die Individuen einzigartig machen, verbessern

    Comprehensive benchmarking of SNV callers for highly admixed tumor data.

    No full text
    Precision medicine attempts to individualize cancer therapy by matching tumor-specific genetic changes with effective targeted therapies. A crucial first step in this process is the reliable identification of cancer-relevant variants, which is considerably complicated by the impurity and heterogeneity of clinical tumor samples. We compared the impact of admixture of non-cancerous cells and low somatic allele frequencies on the sensitivity and precision of 19 state-of-the-art SNV callers. We studied both whole exome and targeted gene panel data and up to 13 distinct parameter configurations for each tool. We found vast differences among callers. Based on our comprehensive analyses we recommend joint tumor-normal calling with MuTect, EBCall or Strelka for whole exome somatic variant calling, and HaplotypeCaller or FreeBayes for whole exome germline calling. For targeted gene panel data on a single tumor sample, LoFreqStar performed best. We further found that tumor impurity and admixture had a negative impact on precision, and in particular, sensitivity in whole exome experiments. At admixture levels of 60% to 90% sometimes seen in pathological biopsies, sensitivity dropped significantly, even when variants were originally present in the tumor at 100% allele frequency. Sensitivity to low-frequency SNVs improved with targeted panel data, but whole exome data allowed more efficient identification of germline variants. Effective somatic variant calling requires high-quality pathological samples with minimal admixture, a consciously selected sequencing strategy, and the appropriate variant calling tool with settings optimized for the chosen type of data
    corecore